Taku OHSAWA Koji KAI Kazuaki MURAKAMI
In merged DRAM/logic LSIs, it is necessary to reduce the number of DRAM refreshes because of higher heat dissipation caused by the logic portion on the same chip. In order to overcome this problem, we propose several DRAM refresh architectures. The basic is to eliminate unnecessary DRAM refreshes. In addition to this, we propose a method for reducing the number of DRAM refreshes by relocating data. In order to evaluate these architectures and method, we have estimated the DRAM refresh count in executing benchmark programs under several models which simulate each combination of them. As a result, in the most effective combination, we have obtained more than 80% reduction against a conventional DRAM refresh architecture for most of benchmark programs. In addition to it, we have taken normal DRAM access into account, even then we have obtained more than 50% reduction for several benchmarks.
Hyacinthe NZIGOU MAMADOU Takeshi NANRI Kazuaki MURAKAMI
The paper presents a novel approach to estimate the performance of MPI collective communications. Our objective is to help researchers to make appropriate decisions on their message-passing applications. For each collective communication, we attempt to apply LogGP and P-LogP standard point-to-point models. The resulted models are compared with the empirical data in order to identify the most suitable for performance characterization of collective operations. For the communications on large clusters with large size messages, the network contention problem can significantly affect the performance. Hence, to reduce the relative gap between the prediction and the measured runtime, the contention issue is also modeled, by a queuing theory analysis method, and taken in account with the total performance estimation. The experiments performed on a cluster which consists of 64 processors interconnected by Gigabit Ethernet network show encouraging results. For any collective operation, given a number of processors and a range of message sizes, there is at least one model that predicts the performance precisely. We could achieve a gap between the predicted and the measured run-time around 15%. Thus, by handling the contention problem, we could reduce around 80% of the relative gap.
Koji INOUE Koji KAI Kazuaki MURAKAMI
This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. In our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.
Koji INOUE Tohru ISHIHARA Kazuaki MURAKAMI
This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.
Makoto SUGIHARA Tohru ISHIHARA Kazuaki MURAKAMI
This paper proposes a soft-error model for accurately estimating reliability of a computer system at the architectural level within reasonable computation time. The architectural-level soft-error model identifies which part of memory modules are utilized temporally and spatially and which single event upsets (SEUs) are critical to the program execution of the computer system at the cycle accurate instruction set simulation (ISS) level. The soft-error model is capable of estimating reliability of a computer system that has several memory hierarchies with it and finding which memory module is vulnerable in the computer system. Reliability estimation helps system designers apply reliable design techniques to vulnerable part of their design. The experimental results have shown that the usage of the soft-error model achieved more accurate reliability estimation than conventional approaches. The experimental results demonstrate that reliability of computer systems depends on not only soft error rates (SERs) of memories but also the behavior of software running in computer systems.
Kazuaki MURAKAMI Morihiro KUGA Oubong GWUN Shinji TOMITA
Superscalar processors can improve uniprocessor performance further byond RISC performance by exploiting spatial instruction-level parallelism. Superscalar processor design presents more opportunities for tradeoffs than conventional RISC design. In order to utilize processor resources augmented by the superscalar approaches, processors must be carefully designed and implemented. This paper examines the various aspects of superscalar processors and discusses the design features and tradeoffs. Specific aspects of superscalar processors that are examined include: instruction fetch boundary, instruction-cache line crossing, branch prediction, data-hazard resolution, control-hazard resolution, and precise or imprecise interrupts. This paper uses a superscalar simulator that modeled a DDU (Dynamically-hazard-resolved, Dynamic-code-scheduled, Uniform) superscalar architecture, called SIMP (Single Instructions stream/Multiple instruction Pipelining), and evaluate many different SIMP hardware organizations. This paper concludes that a superscalar processor can increase the performance with major five hardwary features: instruction aligning, branch prediction with branch-target buffer, code scheduling, speculative execution with conditional mode, and imprecise interrupts. However, the first three functions are claimed to be performed by compilers rather than by hardware.
Katsuhiko METSUGI Kazuaki MURAKAMI
TLSP (Thread-Level Speculative Parallel processing) architecture is a growing processor architecture. Parallelism of a program executed on this architecture is ruled by the combination of techniques which relax data dependences. In this paper, we evaluate the limits of parallelism of the TLSP architecture by using abstract machine models. We have three major results. First, if we use solely each technique which relaxes data dependences, "renaming" has a large effect on the TLSP architecture. Second, combinatorial use of "memory disambiguation" and "renaming" leads to huge parallelism. Third, constant effects are obtained by concurrent use of "value prediction" and other techniques.
Kazuaki MURAKAMI Hidetaka MAGOSHI
This paper briefly surveys architectural technologies of recent or future high-performance, low-power processors for improving the performance and power/energy consumption simultaneously. Achieving both high performance and low power at the same time imposes a lot of challenges on processor design, and therefore gives us a lot of opportunities for devising new technologies. The paper also tries to provide some insights into the technology direction in future.
Takatsugu ONO Koji INOUE Kazuaki MURAKAMI Kenji YOSHIDA
This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines.
Despite the prevalence of Java workloads across a variety of processor architectures, there is very little published data on the impact of the various processor design decisions on Java performance. We attribute the lack of data to the large design space resulting from the complexity of the modern superscalar processor and the additional complexities associated with executing Java bytecode using a virtual machine. To address this shortcoming, we use a statistically rigorous methodology to systematically quantify the the impact of the various processor microarchitecture parameters on Java execution performance. The adopted methodology enables efficient screening of significant factor effects in a large design space consisting of 35 factors (32-billion potential configurations) using merely 72 observations per benchmark application. We quantify and tabulate the significance of each of the 35 factors for 13 benchmark applications. While these tables provide various insights into Java performance, they consistently highlight the performance significance of the instruction delivery mechanism, especially the instruction cache and the ITLB design parameters. Furthermore, these tables enable the architect to identify processor bottlenecks for Java workloads by providing an estimate of the relative impact of various design decisions.
Mariko SAKAMOTO Akira KATSUNO Go SUGIZAKI Toshio YOSHIDA Aiichiro INOUE Koji INOUE Kazuaki MURAKAMI
Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.
Makoto SUGIHARA Taiga TAKATA Kenta NAKAMURA Ryoichi INANAMI Hiroaki HAYASHI Katsumi KISHIMOTO Tetsuya HASEBE Yukihiro KAWANO Yusuke MATSUNAGA Kazuaki MURAKAMI Katsuya OKUMURA
We propose a cell library development methodology for throughput enhancement of character projection equipment. First, an ILP (Integer Linear Programming)-based cell selection is proposed for the equipment for which both of the CP (Character Projection) and VSB (Variable Shaped Beam) methods are available, in order to minimize the number of electron beam (EB) shots, that is, time to fabricate chips. Secondly, the influence of cell directions on area and delay time of chips is examined. The examination helps to reduce the number of EB shots with a little deterioration of area and delay time because unnecessary directions of cells can be removed. Finally, a case study is shown in which the numbers of EB shots are shown for several cases.
Hiroshi KATAOKA Hiroaki HONDA Farhad MEHDIPOUR Nobuyuki YOSHIKAWA Akira FUJIMAKI Hiroyuki AKAIKE Naofumi TAKAGI Kazuaki MURAKAMI
The single flux quantum (SFQ) is expected to be a next-generation high-speed and low-power technology in the field of logic circuits. CMOS as the dominant technology for conventional processors cannot be replaced with SFQ technology due to the difficulty of implementing feedback loops and conditional branches using SFQ circuits. This paper investigates the applicability of a reconfigurable data-path (RDP) accelerator based on SFQ circuits. The authors introduce detailed specifications of the SFQ-RDP architecture and compare its performance and power/performance ratio with those of a graphics-processing unit (GPU). The results show at most 1600 times higher efficiency in terms of Flops/W (floating-point operations per second/Watt) for some high-performance computing application programs.